Search CORE

81 research outputs found

Hardware acceleration of the trace transform for vision applications

Author: Fahmy Suhaib A.
Publication venue: Department of Electrical and Electronic Engineering, Imperial College London
Publication date: 01/01/2008
Field of study

Computer Vision is a rapidly developing field in which machines process visual data to extract meaningful information. Digitised images in their pixels and bits serve no purpose of their own. It is only by interpreting the data, and extracting higher level information that a scene can be understood. The algorithms that enable this process are often complex, and data-intensive, limiting the processing rate when implemented in software. Hardware-accelerated implementations provide a significant performance boost that can enable real- time processing. The Trace Transform is a newly proposed algorithm that has been proven effective in image categorisation and recognition tasks. It is flexibly defined allowing the mathematical details to be tailored to the target application. However, it is highly computationally intensive, which limits its applications. Modern heterogeneous FPGAs provide an ideal platform for accelerating the Trace transform for real-time performance, while also allowing an element of flexibility, which highly suits the generality of the Trace transform. This thesis details the implementation of an extensible Trace transform architecture for vision applications, before extending this architecture to a full flexible platform suited to the exploration of Trace transform applications. As part of the work presented, a general set of architectures for large-windowed median and weighted median filters are presented as required for a number of Trace transform implementations. Finally an acceleration of Pseudo 2-Dimensional Hidden Markov Model decoding, usable in a person detection system, is presented. Such a system can be used to extract frames of interest from a video sequence, to be subsequently processed by the Trace transform. All these architectures emphasise the need for considered, platform-driven design in achieving maximum performance through hardware acceleration

Spiral - Imperial College Digital Repository

Design abstraction for autonomous adaptive hardware systems on FPGAs

Author: Fahmy Suhaib A.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/11/2018
Field of study

Adaptive hardware is gaining importance with the emergence of more autonomous systems that must process large volumes of sensor data and react within tight deadlines. To support such computation within the constraints of embedded deployments, a blend of high throughput hardware processing and adaptive control is required. FPGAs offer an ideal platform for implementing such systems by virtue of their hardware flexibility and sensor interfacing capabilities. FPGA SoCs are specifically well suited offering capable embedded processors that are tightly coupled with a flexible high performance FPGA fabric. This paper explores existing work on adaptive hardware systems before proposing a general model and implementation approach tailored towards these modern FPGA architectures, concluding with pointers for research in this emerging field

Crossref

Warwick Research Archives Portal Repository

Square-rich fixed point polynomial evaluation on FPGAs

Author: Fahmy Suhaib A.
McLoughlin Ian V.
Xu Simin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2014
Field of study

Polynomial evaluation is important across a wide range of application domains, so significant work has been done on accelerating its computation. The conventional algorithm, referred to as Horner's rule, involves the least number of steps but can lead to increased latency due to serial computation. Parallel evaluation algorithms such as Estrin's method have shorter latency than Horner's rule, but achieve this at the expense of large hardware overhead. This paper presents an efficient polynomial evaluation algorithm, which reforms the evaluation process to include an increased number of squaring steps. By using a squarer design that is more efficient than general multiplication, this can result in polynomial evaluation with a 57.9% latency reduction over Horner's rule and 14.6% over Estrin's method, while consuming less area than Horner's rule, when implemented on a Xilinx Virtex 6 FPGA. When applied in fixed point function evaluation, where precision requirements limit the rounding of operands, it still achieves a 52.4% performance gain compared to Horner's rule with only a 4% area overhead in evaluating 5th degree polynomials

Kent Academic Repository

High throughput spatial convolution filters on FPGAs

Author: Al-Dujaili Abdullah
Fahmy Suhaib A.
Ioannou Lenos
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/04/2020
Field of study

Digital signal processing (DSP) on field- programmable gate arrays (FPGAs) has long been appealing because of the inherent parallelism in these computations that can be easily exploited to accelerate such algorithms. FPGAs have evolved significantly to further enhance the mapping of these algorithms, included additional hard blocks, such as the DSP blocks found in modern FPGAs. Although these DSP blocks can offer more efficient mapping of DSP computations, they are primarily designed for 1-D filter structures. We present a study on spatial convolutional filter implementations on FPGAs, optimizing around the structure of the DSP blocks to offer high throughput while maintaining the coefficient flexibility that other published architectures usually sacrifice. We show that it is possible to implement large filters for large 4K resolution image frames at frame rates of 30–60 FPS, while maintaining functional flexibility

Warwick Research Archives Portal Repository

ZyCAP : efficient partial reconfiguration management on the Xilinx Zynq

Author: Fahmy Suhaib A.
Vipin Kizheppatt
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/09/2014
Field of study

New hybrid FPGA platforms that couple processors with a reconfigurable fabric, such as the Xilinx Zynq, offer an alternative view of reconfigurable computing where software applications leverage hardware resources through the use of often reconfigured accelerators. For this to be feasible, reconfiguration overheads must be reduced so that the processor is not burdened with managing the process. We discuss partial reconfiguration (PR) on these architectures, and present an open source controller, ZyCAP, that overcomes the limitations of existing methods, offering more effective use of hardware resources in such architectures. ZyCAP combines high-throughput configuration with a high-level software interface that frees the processor from detailed PR management, making PR on the Zynq easy and efficient

Warwick Research Archives Portal Repository

Minimizing DSP block usage through multi-pumping

Author: Fahmy Suhaib A.
Ronak Bajaj
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 28/01/2016
Field of study

Resource sharing in the mapping of an algorithm to an architecture allows the same resource to be scheduled for different uses in different cycles, generally at the cost of increased schedule length. Multi-pumping is a method whereby a resource is clocked at a frequency that is a multiple of the surrounding circuit, thereby offering multiple executions per global clock, and therefore sharing in the same clock cycle. This concept maps well to FPGA architectures, where hard macro blocks are typically capable of running at higher frequencies than standard logic. While this technique has been demonstrated for multipliers, modern DSP blocks are more complex with multiple computational nodes. In this paper, we apply multi-pumping to minimise DSP block usage, while taking advantage of the multiple nodes they support. The proposed approach uses, on average, 39% fewer DSP blocks, at a cost of 19% more LUTs and 7% more registers

Scipedia

Warwick Research Archives Portal Repository

Efficient multi-standard cognitive radios on FPGAs

Author: Fahmy Suhaib A.
McLoughlin Ian Vince
Pham Thinh H.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/09/2014
Field of study

Cognitive radios that support multiple standards and modify operation depending on environmental conditions are becoming more important as the demand for higher bandwidth and efficient spectrum use increases. Traditional implementations in custom ASICs cannot support such flexibility, with standards changing at a faster pace, while software baseband implementations fail to achieve the performance required. Hence, FPGAs offer an ideal platform bringing together flexibility, performance, and efficiency. This work explores the possible techniques for designing multi-standard radios on FPGAs, and explores how partial reconfiguration can be leveraged in a way that is amenable for domain experts with minimal FPGA knowledge

Kent Academic Repository

FPGA dynamic and partial reconfiguration : a survey of architectures, methods, and applications

Author: Fahmy Suhaib A.
Vipin Kizheppatt
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/09/2018
Field of study

Dynamic and partial reconfiguration are key differentiating capabilities of field programmable gate arrays (FPGAs). While they have been studied extensively in academic literature, they find limited use in deployed systems. We review FPGA reconfiguration, looking at architectures built for the purpose, and the properties of modern commercial architectures. We then investigate design flows, and identify the key challenges in making reconfigurable FPGA systems easier to design. Finally, we look at applications where reconfiguration has found use, as well as proposing new areas where this capability places FPGAs in a unique position for adoption

Warwick Research Archives Portal Repository

Multipumping flexible DSP blocks for resource reduction on Xilinx FPGAs

Author: Fahmy Suhaib A.
Ronak Bajaj
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/09/2017
Field of study

For complex datapaths, resource sharing can help reduce area consumption. Traditionally, resource sharing is applied when the same resource can be scheduled for different uses in different cycles, often resulting in a longer schedule. Multipumping is a method whereby a resource is clocked at a frequency that is a multiple of the surrounding circuit, thereby offering multiple executions per global clock cycle. This allows a single resource to be shared among multiple uses in the same cycle. This concept maps well to modern field-programmable gate arrays (FPGAs), where hard macro blocks are typically capable of running at higher frequencies than most designs implemented in the logic fabric. While this technique has been demonstrated for static resources, modern digital signal processing (DSP) blocks are flexible, supporting varied operations at runtime. In this paper, we demonstrate multipumping for resource sharing of the flexible DSP48E1 macros in Xilinx FPGAs. We exploit their dynamic programmability to enable resource sharing for the full set of supported DSP block operations, and compare this to multipumping only multipliers and DSP blocks with fixed configurations. The proposed approach saves on average 48% DSP blocks at a cost of 74% more LUTs, effectively saving 30% equivalent LUT area and is feasible for the majority of designs, in which clock frequency is typically below half the maximum supported by the DSP blocks

Crossref

Warwick Research Archives Portal Repository